{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DaXDovdJVMDo"
      },
      "source": [
        "# Embeddings Encoder Customization\n",
        "\n",
        "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/langkit/blob/main/langkit/examples/Custom_Encoder.ipynb)\n",
        "\n",
        "Some modules in LangKit, such as the __themes__ and __input_output__ module, use an encoder to convert text into embeddings. In this example, we will show how you can plug in your own encoder to LangKit.\n",
        "\n",
        "## Default Behavior\n",
        "\n",
        "Let's use the themes module to talk about the default behavior. If you simply import the module without additional configuration, the default encoder will be used, which is the `all-MiniLM-L6-v2` model from the [Sentence Transformers](https://www.sbert.net/) library.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3rLjVWGhVMDq"
      },
      "outputs": [],
      "source": [
        "%pip install -U langkit[all] \n",
        "%pip install tensorflow tensorflow_hub"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "uFBm39ZZVMDq",
        "outputId": "ef4a6571-aa2c-4c43-ffbf-fbc52710280a"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "The similarity score with default encoder is:  0.9999999403953552\n"
          ]
        }
      ],
      "source": [
        "from langkit import themes\n",
        "\n",
        "similarity_score = themes.group_similarity(\"Sorry, but I can't assist you with that.\", \"refusal\")\n",
        "\n",
        "print(\"The similarity score with default encoder is: \", similarity_score)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ra5QuSbMVMDr"
      },
      "source": [
        "## Custom Encoder\n",
        "\n",
        "You can also pass your own encoder function to be used by LangKit. Let's define a simple function that takes a list of strings and returns a list of embeddings, one for each input string. Then, we pass that function to the `themes` initializer as the `custom_encoder` parameter."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "ltsMHWQhVMDr"
      },
      "outputs": [],
      "source": [
        "from typing import List\n",
        "def embed(texts: List[str]) -> List[List[float]]:\n",
        "    return [[0.2,0.2] for _ in texts]\n",
        "\n",
        "themes.init(custom_encoder=embed)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NLo9HEosVMDr"
      },
      "source": [
        "Now, if we run the similarity calculation again, we see that the score is near 1.0. That makes sense, considering the embedding for every string is the same, and the cosine similarity between the same vector is 1.0."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "YWRipJ3WVMDr",
        "outputId": "a7471dd7-6e85-4194-b3b4-362d733002c4"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "The similarity score with the custom encoder is:  0.9999999403953552\n"
          ]
        }
      ],
      "source": [
        "similarity_score = themes.group_similarity(\"Sorry, but I can't assist you with that.\", \"refusal\")\n",
        "\n",
        "print(\"The similarity score with the custom encoder is: \", similarity_score)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lmoY5TsBVMDr"
      },
      "source": [
        "## Universal Sentence Encoder\n",
        "\n",
        "Let's show another example with a real encoder. We will Google's Universal Sentence Encoder to encode sentences into vectors. We will use the [TensorFlow Hub](https://www.tensorflow.org/hub) to download the model, and pass the embed function into Langkit, just as before.\n",
        "\n",
        "> If your local environment doesn't have the required dependencies to use `tensorflow_hub`, we recommend running this example on Colab by clicking the button on the top of this example."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "id": "9T4q0Yo_VMDr"
      },
      "outputs": [],
      "source": [
        "import tensorflow_hub as hub\n",
        "\n",
        "# Load pre-trained universal sentence encoder model\n",
        "use_embed = hub.load(\"https://tfhub.dev/google/universal-sentence-encoder/4\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "whrIj26IXtYf"
      },
      "source": [
        "Let's use `input_output` as an example this time, which will calculate the similarity score between a pair of prompt and response."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "McnUncp6VMDs",
        "outputId": "24223c7c-3667-4b10-9574-f01a52e474b4"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "The similarity score with the Universal Sentence Encoder is:  [0.08993373811244965]\n"
          ]
        }
      ],
      "source": [
        "from langkit.input_output import init, prompt_response_similarity\n",
        "\n",
        "init(custom_encoder=use_embed)\n",
        "\n",
        "similarity_score = prompt_response_similarity({\"prompt\": [\"What is the capital of France?\"], \"response\": [\"Mitochondria is the powerhouse of the cell\"]})\n",
        "\n",
        "print(\"The similarity score with the Universal Sentence Encoder is: \", similarity_score)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "langkit-rNdo63Yk-py3.8",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.10"
    },
    "orig_nbformat": 4
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
